<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Non English language support in Terrier</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<link rel="stylesheet" type="text/css" charset="utf-8" media="all" href="docs.css">
</head>

<body>
<!--!bodystart-->
[<a href="extend_retrieval.html">Previous: Extending Retrieval</a>] [<a href="index.html">Contents</a>] [<a href="dfr_description.html">Next: DFR Description</a>]
<table width="100%"><tr><td width="82%" valign="bottom">
<h1>Non English language support in Terrier</h1></td>
<!--!bodyremove-->
<td width="18%"><a href="http://ir.dcs.gla.ac.uk/terrier/"><img src="images/terrier-logo-web.jpg" border="0"></a></td>
<!--!/bodyremove-->
</tr></table>

<h2>Index format</h2>
<p align="justify">When indexing documents in languages other than English, you should use the UTF index format. By default, Terrier assumes that indexed documents only contain terms without accents. Setting the property <tt>string.use_utf</tt> to true will use the UTFLexicon which overcomes this issue, by storing all terms in UTF8.</p>

<h2>Collection &amp; Document support</h2>
<p align="justify"><a href="javadoc/uk/ac/gla/terrier/indexing/TRECCollection.html">TRECCollection</a> assumes that all valid characters in terms are A-Z, a-z and 0-9. Obviously this assumption is incorrect when indexing documents in languages other than English. For this reason, you should use a Collection object which supports other languages. In most cases, this should be <a href="javadoc/uk/ac/gla/terrier/indexing/TRECUTFCollection.html">TRECUTFCollection</a>. Specify by setting the property <tt>trec.collection.class=TRECUTFCollection</tt>. (TRECUTFCollection uses <tt>Character.isLetterOrDigit()</tt> to determine term boundaries).</p>

<p>Note that the FileDocument, HTMLDocument etc classes used by the Desktop Terrier do not yet support other languages.
</p>

<h3>Stemmers</h3>
<p align="justify">Starting with Terrier 1.1.1, we have included all stemmers from the <a href="http://snowball.tartarus.org/">Snowball</a>. Currently, this means that the following stemmers can be applied from Terrier:</p>
<ul>
<li><a href="javadoc/uk/ac/gla/terrier/terms/DanishSnowballStemmer.html">DanishSnowballStemmer</a></li>
<li><a href="javadoc/uk/ac/gla/terrier/terms/DutchSnowballStemmer.html">DutchSnowballStemmer</a></li>
<li><a href="javadoc/uk/ac/gla/terrier/terms/EnglishSnowballStemmer.html">EnglishSnowballStemmer</a></li>
<li><a href="javadoc/uk/ac/gla/terrier/terms/FinnishSnowballStemmer.html">FinnishSnowballStemmer</a></li>
<li><a href="javadoc/uk/ac/gla/terrier/terms/FrenchSnowballStemmer.html">FrenchSnowballStemmer</a></li>
<li><a href="javadoc/uk/ac/gla/terrier/terms/GermanSnowballStemmer.html">GermanSnowballStemmer</a></li>
<li><a href="javadoc/uk/ac/gla/terrier/terms/HungarianSnowballStemmer.html">HungarianSnowballStemmer</a></li>
<li><a href="javadoc/uk/ac/gla/terrier/terms/ItalianSnowballStemmer.html">ItalianSnowballStemmer</a></li>
<li><a href="javadoc/uk/ac/gla/terrier/terms/NorwegianSnowballStemmer.html">NorwegianSnowballStemmer</a></li>
<li><a href="javadoc/uk/ac/gla/terrier/terms/PortugueseSnowballStemmer.html">PortugueseSnowballStemmer</a></li>
<li><a href="javadoc/uk/ac/gla/terrier/terms/RomanianSnowballStemmer.html">RomanianSnowballStemmer</a></li>
<li><a href="javadoc/uk/ac/gla/terrier/terms/RussianSnowballStemmer.html">RussianSnowballStemmer</a></li>
<li><a href="javadoc/uk/ac/gla/terrier/terms/SpanishSnowballStemmer.html">SpanishSnowballStemmer</a></li>
<li><a href="javadoc/uk/ac/gla/terrier/terms/SwedishSnowballStemmer.html">SwedishSnowballStemmer</a></li>
<li><a href="javadoc/uk/ac/gla/terrier/terms/TurkishSnowballStemmer.html">TurkishSnowballStemmer</a></li>
</ul>

<h2>Batch Retrieval</h2>
<p align="justify">When experimenting with topics in files other than English, Terrier will use a suitable topic file tokeniser if the property <tt>string.use_utf</tt> is set to true.</p>

[<a href="extend_retrieval.html">Previous: Extending Retrieval</a>] [<a href="index.html">Contents</a>] [<a href="dfr_description.html">Next: DFR Description</a>]
<!--!bodyend-->
<hr>
<small>
Webpage: <a href="http://ir.dcs.gla.ac.uk/terrier">http://ir.dcs.gla.ac.uk/terrier</a><br>
Contact: <a href="mailto:terrier@dcs.gla.ac.uk">terrier@dcs.gla.ac.uk</a><br>
<a href="http://www.dcs.gla.ac.uk/">Department of Computing Science</a><br>

Copyright (C) 2004-2008 <a href="http://www.gla.ac.uk/">University of Glasgow</a>. All Rights Reserved.
</small>
</body>
</html>
